adelie_shuffled <- withr::with_seed(2,
adelie_bills |>
specify(bill_length_mm ~ sex) |>
hypothesize(null = "independence") |>
infer::generate(reps = 1000, type = "permute")
)Introduction to statistics
2024-10-22
In science we collect data to test hypotheses
We use statistics to quantify evidence for or against hypotheses
We quantify evidence through the building of statistical models
Models are a simplification of the real, complex system we are studying
Statistical inference relates to using probabilities to make decisions about parameters of interest in a population from statistics (estimates) derived from a subset or sample from that population
Are the population means of two groups, control & treated, different?
Key concepts
Interested in the efficacy of a new drug for treating disease in animals
The population of interest is all the animals with the disease across herds in a sector
We could recruit the farmers managing all the herds in the sector into a study, and give each animal either the new drug or a control
Measure the recovery time of the two groups
Problems are:
An alternative approach is to sample from the population of interest
Take a random, representative subset of animals (or herds) from the population of interest
How representative the sample is depends on how it was collected
Many statistical methods assume that the data represent a random sample from a population
Difficult to do in practice
Alternatively we could cluster sample (random stratified sampling); select n representative herds and then randomly sample individuals from within those herds
Statistics computed on samples will be rarely exactly equal to the population values
Is there evidence of sexual dimorphism in Adelie penguin bill lengths?
A distinction is often made between parameters and statistics
A sample statistic is an estimates of a population parameter
Interchange between statistics and parameter estimates
Probabilities describe the chance that an event will occur
Probabilities range from 0 to 1, with smaller values indicating the occurrence unlikely and larger values indicating occurrence is likely
Compute probabilities as \(\displaystyle \frac{m}{M}\); \(m\) is the number of possible outcomes of interest \(M\) is the number of all possible outcomes
Probability of a head on a single coin toss: \(\frac{1}{2} = 0.5\)
Probability of rolling an odd number on a 10-side die: \(\frac{5}{10} = 0.5\)
Probability of rolling a value less than 3 on a 10-sided die: \(\frac{2}{10} = 0.2\)
What is the probability of randomly selecting two samples with means as different as observed if the samples came from the same population?
Need to consider the sampling error:
Earlier we saw a difference of means of the bill lengths of male and female Adelie penguins of 3.18
How likely is a difference of means of >3? How likely is a difference of means of >0.5?
Read off from the cumulative distribution function of the sampling distribution of statistic
Often, the null hypothesis, \(\text{H}_0\), is that there is no effect
\(\text{H}_0\): there is no effect of sex on the bill length of Adelie penguins
This means that, under \(\text{H}_0\) we can rearrange the sex variable randomly, pairing any value of bill_length_mm with a male or female penguin
We typically denote the alternative hypothesis as \(\text{H}_1\) or \(\text{H}_{\text{a}}\)
This hypothesis is usually what we are interested in knowing about, but we can’t test it explicitly
Under \(\text{H}_0\), these shufflings of the data should look similar to our observed data
Do the shuffled data look like the observed data?
What we just did is the basis of a permutation test of \(\text{H}_0\)
A real test would shuffle — permute — the data many more times than 4
Response: bill_length_mm (numeric)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 40,000 × 3
# Groups: replicate [1,000]
bill_length_mm sex replicate
<dbl> <fct> <int>
1 41.1 female 1
2 39.6 female 1
3 36.9 female 1
4 40.1 female 1
5 42.7 female 1
6 38.1 female 1
7 36.7 female 1
8 40.6 female 1
9 39.2 female 1
10 36.6 female 1
# ℹ 39,990 more rows
With the infer 📦 we
specify() to say which variables are involved in the test and which is the response and which are the explanatory variables, andhypothesize() to indicate what are we testing — here we are testing the independence of bill_length_mm and sex, and thengenerate() to create 10,000 permutations of the data, which areadelie_shuffledWe also wrapped all that in withr::with_seed(), which is another way of setting random seeds for the RNG
Now we have 1,000 randomly shuffled — permuted — data sets
Our null hypothesis (\(\text{H}_0\)) is that there is
no difference in the mean bill length of male and female Adelie penguins
For each permuted data set we need to calculate the
bill_length_mm for the males — \(\hat{\mu}_{\text{Males}}\)bill_length_mm for the females — \(\hat{\mu}_{\text{Females}}\)This difference of means will be our test statistic
We could do those calculations using dplyr but infer makes it even easier via calculate()
"diff in means" is how we indicate we want a difference of meansorder indicates the order of the means in the differenceThe order doesn’t matter, so long as we are consistent
We did \(\hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}}\) so we will put male first
This generates the null distribution of the test statistic
The null distribution shows the distribution of the test statistic under \(\text{H}_0\) of no effect
We also need to compute the observed value of the test statistic (i.e. the one for our data)
Do this using the same pipeline, without the hypothesize() and generate() steps
If we add the the observed value to the plot we can see how typical our observed difference of means is to the values we would expect if there was no effect pf sex
Whether a test is one- or two-tailed is a common distinction in research studies
Relates to the alternative hypothesis:
This is where concept of one- or two-tailed tests comes from; in which tail do we find the rejection region
If we add the the observed value to the plot we can see how typical our observed difference of means is to the values we would expect if there was no effect of sex
We asked for direction = "both" so the alternative hypothesis is that the males and females have different mean bill lengths
We typically denote the alternative hypothesis as \(\text{H}_1\) or \(\text{H}_{\text{a}}\)
This hypothesis is usually what we are interested in knowing about, but we can’t test it explicitly
\(\text{H}_{\text{a}}\) could be:
Option 1 is a two-sided test so we set direction to be "two-sided" or "both"
Option 2 is a one-sided test so we set direction to be "greater" or "right"
Option 3 is also a one-sided test but we set direction to be "less" or "left"
(for order = c("male", "female")!)
Before observing the penguins, if we had hypothesized that males had longer bills than females, we could have done a one-sided test of the hypothesis
male Adelie penguins have longer bill lengths on average than females
\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} = \hat{\mu}_{\text{Females}} \\ \text{H}_{\text{a}} & : \hat{\mu}_{\text{Males}} > \hat{\mu}_{\text{Females}} \end{align*}\]
It might be easier to express our hypotheses in terms of the difference of means
\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} \neq 0 \end{align*}\]
"less"\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} < 0 \end{align*}\]
"greater"\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} > 0 \end{align*}\]
Using the null distribution we can obtain a p-value
the probability that a test statistic as or more extreme than the observed statistic would occur if the null hypothesis were true
We count how many difference of means in the null distribution are
the observed difference of means \(\hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}}\)
Why “in absolute value” for the two-sided test?
In our case, the p-value is reported to be 0
# p value for the simpler alternative of the bill lengths are different
adelie_null_distribution |>
get_p_value(obs_stat = obs_diff_means, direction = "two-sided")# A tibble: 1 × 1
p_value
<dbl>
1 0
# p value for the stronger alternative of male bill lengths are larger
adelie_null_distribution |>
get_p_value(obs_stat = obs_diff_means, direction = "greater")# A tibble: 1 × 1
p_value
<dbl>
1 0
This doesn’t make sense so instead we say the \(p\) is less than 0.001
\[\text{p value} = p = \frac{1}{P}\]
\(P\) is the number of permuted data sets we generated (1000 here)
This is the basis of Null Hypothesis Significance Testing or NHST
The null hypothesis is the situation we test; no difference in means between groups, no relationship between x and y
Look for evidence against this null hypothesis; unlikely values of statistics if the null hypothesis were true
We permuted the data to generate the null distribution of the test statistic
the sampling distribution of the test statistic under \(\text{H}_0\)
Much theoretical work in statistics is in deriving sampling distributions of different statistics, or types of statistics, using maths instead of resampling (permuting)
The standard normal distribution is a Gaussian distribution with \(\mathsf{\mu = 0}\) & \(\mathsf{\sigma^2 = 1}\)
People often call values from this distribution z-scores
Calculate z-scores by subtracting the mean score from each score and dividing by \(\sigma\)
\[z_i = \mathsf{\frac{\text{score}_i - \text{mean}}{\text{standard deviation}}}\]
\[z_i = \mathsf{\frac{14 - \text{14.95}}{\text{5.47}}} = \mathsf{-0.17}\]
Negative z-scores lie below the sample mean, positive scores above it
A z-score of -22.52 tells us that the score of 2mm is 22.52 standard deviations below the mean
68% probability z-score lies between -1–1; 95% probability z-score lies between -1.96–1.96
Probability distributions, like the standard normal or t distribution, help us compute the probability of obtaining a statistics if the null hypothesis were true
This is the probability due the chance
Under standard normal, probability of observing a score of 2.5 is 0.006
Under t distribution with 5 df, probability of observing a score of 2.5 is 0.027
t distribution often used for small samples & where sample variance is estimated (not know) which for many samples will be an underestimate of the true variance
The p-value tells us nothing about the probability of the null hypothesis or the alternative hypothesis
Redrawn from Krzywinski & Altman (2013) Significance, P values and t-tests. Nature Methods 10(11) 1041–1042 doi: 10.1038/nmeth.2698
Whether a test is one- or two-tailed is a common distinction in research studies
Relates to the hypothesis:
This is where concept of one- or two-tailed tests comes from; in which tail do we find the rejection region
The classical test for a difference of means is known as a two-sample t test
Two Sample t-test
data: bill_length_mm by sex
t = -5.4197, df = 38, p-value = 3.557e-06
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-4.374686 -1.995314
sample estimates:
mean in group female mean in group male
37.355 40.540
The two-sample t test is a special case of a linear model
Call:
lm(formula = bill_length_mm ~ sex, data = adelie_bills)
Residuals:
Min 1Q Median 3Q Max
-4.2400 -1.0262 0.0025 0.8350 4.8450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.3550 0.4155 89.89 < 2e-16 ***
sexmale 3.1850 0.5877 5.42 3.56e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.858 on 38 degrees of freedom
Multiple R-squared: 0.436, Adjusted R-squared: 0.4211
F-statistic: 29.37 on 1 and 38 DF, p-value: 3.557e-06
The two-sample t test is a special case of a linear model
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
3.19 0.588 5.42 <0.001 24.0 2.03 4.34
Term: sex
Type: response
Comparison: mean(male) - mean(female)
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
We will focus on linear models on this course
In NHST we obtain the probability (\(p\)) of observing as extreme a statistic as the observed if the null hypothesis were true: evidence against the null
If \(p\) is small we reject the null hypothesis & accept the alternative: chance that we falsely reject the Null
If \(p\) is large we fail to reject the null hypothesis & thus reject the alternative: chance that we falsely fail to reject the Null
Randall Munroe http://xkcd.com/1478/
A major problem with peoples’ use of NHST is the over-emphasis put on the p-value
The p-value is just a measure of conditional probability; just because you get a low p-value doesn’t mean the effect is important
Increase the sample size and, all things equal, you can detect smaller effects (statistically significant effects)
Important to ask if the effect size is important or relevant?
Effect size refers to the
Statistical power is the ability of a study to detect a real effect
Power is a function of the Type II error:
\[\mathsf{power} = 1 - \beta\]
Power ranges from 0 (no chance of detecting the real effect) to 1 (100% chance of detecting the real effect)
There is no point in conducting a study, especially one involving animals, that has low chance of finding the real effect
Power is a function of several important factors
For simple models (t-tests, one-way ANOVA, etc) there are relatively simple equations for computing the power of a planned study for given sample size, effect size etc.
For more complex models, power calculations become difficult; simulation using a computer is a solution
Diagnostic testing — say for SARS-CoV-2 — is typically thought of as
What does “true positive” mean? Carrying the virus, infectious, …?
What is a “positive” results depends on the type of test (PCR or rapid lateral flow)
All tests can give a “wrong” result
A false negative is a negative test result when patient has the disease
A false positive is a positive test result when the patient doesn’t have the disease
Two-way table
| Disease | No disease | |
|---|---|---|
| + test | TP | FP |
| - test | FN | TN |
We describe the accuracy of test results through two quantities
the proportion of people with the virus who get a positive test result
We also call this sensitivity
the proportion of people not infected with the virus who get a negative test result
We also call this specificity
Confusingly, however, we typically use the complements of these values
False positive rate: 1 - sensitivity
False negative rate: 1 - specificity
At the end of June 2020 in the UK less than 0.05% of PCR test results were positive
Even if no virus circulating, the estimated false positive rate could be no larger than 0.05%
For every 20,000 people without the virus, we would expect just one to test positive
It was also estimated that between 5% and 15% of positive test results were false (FN)
Similar values could be reported for lateral flow devices
A positive test result doesn’t mean you are infected
What’s more important for people is the probability that you are infected if you test positive
More generally, what does a particular test result mean for an individual?
These are predictions, conditional upon a given test result
Suppose a lateral flow test has a TPR = 60% and TNR of 99.9% (a FPR or 1 in 1,000)
Suppose further that the virus is at levels in early March 2021, when ~0.3% of the UK population had an infection
What happens if we test 1,000,000 people?
3,000 people have the disease (0.3% of 1,000,000) of whom 1,800 test positive (60% of 3,000)
997,000 people don’t have the disease but 997 (0.1%) get a false positive test result
The FPR is 1:1,000 yet 36% (1,800 / (1,800 + 997)) of positive test results are false
flowchart LR
poptested["1,000,000"]
disease["3,000"]
nodisease["997,000"]
pop((Population)) --> |tested | poptested
poptested --> | disease | disease
poptested --> | no disease | nodisease
disease --> test1{Test}
test1 --> | positive | tp["1,800"]
test1 --> | negative | fn["1,200"]
nodisease --> test2{Test}
test2 --> | positive | fp["997"]
test2 --> | negative | tn["996,003"]
tp --> tps[TP]
fn --> fns[FN]
fp --> fps[FP]
tn --> tns[TN]
The FPR is 1:1,000 yet 36% (1,800 / (1,800 + 997)) of positive test results are false
This is confusing to many people
as a disease gets rarer, more positive test results will be false
This is an application of Bayes’ Theorem
Shows how important the underlying prevalence of disease is in interpretting test results
Mamography is ~90% accurate for screening breast cancer
Assuming 1% of screened women have breast cancer
flowchart LR
popscreened["1,000"]
cancer["10"]
nocancer["990"]
pop((Population)) --> | screened | popscreened
popscreened --> | cancer | cancer
popscreened --> | no cancer | nocancer
cancer --> test1{Test}
test1 --> | positive | tp[9]
test1 --> | negative | fn[1]
nocancer --> test2{Test}
test2 --> | positive | fp[99]
test2 --> | negative | tn[891]
tp --> tps[TP]
fn --> fns[FN]
fp --> fps[FP]
tn --> tns[TN]
Shows the harm screening can cause if underlying risk is low
Evidence-based veterinary medicine is the formal strategy to integrate the best research evidence available combined with clinical expertise as well as the unique needs or wishes of each client in clinical practice. Much of this is based on results from research studies that have been critically-designed and statistically evaluated.
Current gold standard is the randomised controlled trial (RCT)
You need to develop a range of skills as a veterinarian
Statistical literacry — data literacy
Can you read and appraise the literature?